22 research outputs found
Quasi-Newton Steps for Efficient Online Exp-Concave Optimization
The aim of this paper is to design computationally-efficient and optimal
algorithms for the online and stochastic exp-concave optimization settings.
Typical algorithms for these settings, such as the Online Newton Step (ONS),
can guarantee a bound on their regret after rounds, where
is the dimension of the feasible set. However, such algorithms perform
so-called generalized projections whenever their iterates step outside the
feasible set. Such generalized projections require arithmetic
operations even for simple sets such a Euclidean ball, making the total runtime
of ONS of order after rounds, in the worst-case. In this paper, we
side-step generalized projections by using a self-concordant barrier as a
regularizer to compute the Newton steps. This ensures that the iterates are
always within the feasible set without requiring projections. This approach
still requires the computation of the inverse of the Hessian of the barrier at
every step. However, using the stability properties of the Newton steps, we
show that the inverse of the Hessians can be efficiently approximated via
Taylor expansions for most rounds, resulting in a
total computational complexity, where is the exponent of matrix
multiplication. In the stochastic setting, we show that this translates into a
computational complexity for finding an -suboptimal
point, answering an open question by Koren 2013. We first show these new
results for the simple case where the feasible set is a Euclidean ball. Then,
to move to general convex set, we use a reduction to Online Convex Optimization
over the Euclidean ball. Our final algorithm can be viewed as a more efficient
version of ONS.Comment: First revision: presentation improvement
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
We aim to design adaptive online learning algorithms that take advantage of
any special structure that might be present in the learning task at hand, with
as little manual tuning by the user as possible. A fundamental obstacle that
comes up in the design of such adaptive algorithms is to calibrate a so-called
step-size or learning rate hyperparameter depending on variance, gradient
norms, etc. A recent technique promises to overcome this difficulty by
maintaining multiple learning rates in parallel. This technique has been
applied in the MetaGrad algorithm for online convex optimization and the Squint
algorithm for prediction with expert advice. However, in both cases the user
still has to provide in advance a Lipschitz hyperparameter that bounds the norm
of the gradients. Although this hyperparameter is typically not available in
advance, tuning it correctly is crucial: if it is set too small, the methods
may fail completely; but if it is taken too large, performance deteriorates
significantly. In the present work we remove this Lipschitz hyperparameter by
designing new versions of MetaGrad and Squint that adapt to its optimal value
automatically. We achieve this by dynamically updating the set of active
learning rates. For MetaGrad, we further improve the computational efficiency
of handling constraints on the domain of prediction, and we remove the need to
specify the number of rounds in advance.Comment: 22 pages. To appear in COLT 201
Lipschitz and Comparator-Norm Adaptivity in Online Learning
We study Online Convex Optimization in the unbounded setting where neither
predictions nor gradient are constrained. The goal is to simultaneously adapt
to both the sequence of gradients and the comparator. We first develop
parameter-free and scale-free algorithms for a simplified setting with hints.
We present two versions: the first adapts to the squared norms of both
comparator and gradients separately using time per round, the second
adapts to their squared inner products (which measure variance only in the
comparator direction) in time per round. We then generalize two prior
reductions to the unbounded setting; one to not need hints, and a second to
deal with the range ratio problem (which already arises in prior work). We
discuss their optimality in light of prior and new lower bounds. We apply our
methods to obtain sharper regret bounds for scale-invariant online prediction
with linear models.Comment: 30 Pages, 1 Figur
Adaptivity in Online and Statistical Learning
Many modern machine learning algorithms, though successful, are still based on heuristics. In a typical application, such heuristics may manifest in the choice of a specific Neural Network structure, its number of parameters, or the learning rate during training. Relying on these heuristics is not ideal from a computational perspective (often involving multiple runs of the algorithm), and can also lead to over-fitting in some cases. This motivates the following question: for which machine learning tasks/settings do there exist efficient algorithms that automatically adapt to the best parameters? Characterizing the settings where this is the case and designing corresponding (parameter-free) algorithms within the online learning framework constitutes one of this thesis' primary goals. Towards this end, we develop algorithms for constrained and unconstrained online convex optimization that can automatically adapt to various parameters of interest such as the Lipschitz constant, the curvature of the sequence of losses, and the norm of the comparator. We also derive new performance lower-bounds characterizing the limits of adaptivity for algorithms in these settings. Part of systematizing the choice of machine learning methods also involves having ``certificates'' for the performance of algorithms. In the statistical learning setting, this translates to having (tight) generalization bounds. Adaptivity can manifest here through data-dependent bounds that become small whenever the problem is ``easy''. In this thesis, we provide such data-dependent bounds for the expected loss (the standard risk measure) and other risk measures. We also explore how such bounds can be used in the context of risk-monotonicity
PAC-Bayesian Bound for the Conditional Value at Risk
Conditional Value at Risk (CVaR) is a family of "coherent risk measures"
which generalize the traditional mathematical expectation. Widely used in
mathematical finance, it is garnering increasing interest in machine learning,
e.g., as an alternate approach to regularization, and as a means for ensuring
fairness. This paper presents a generalization bound for learning algorithms
that minimize the CVaR of the empirical loss. The bound is of PAC-Bayesian type
and is guaranteed to be small when the empirical CVaR is small. We achieve this
by reducing the problem of estimating CVaR to that of merely estimating an
expectation. This then enables us, as a by-product, to obtain concentration
inequalities for CVaR even when the random variable in question is unbounded
PAC-Bayes Unexpected Bernstein Inequality
We present a new PAC-Bayesian generalization bound. Standard bounds contain a \sqrt{L_n \cdot \KL/n} complexity term which dominates unless Ln, the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace Ln by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough n). Theoretically, unlike existing bounds, our new bound can be expected to converge to 0 faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and {\em excess risk\/} bounds---for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein's but with X2 taken outside its expectation
PAC-Bayes Un-Expected Bernstein Inequality
International audienceWe present a new PAC-Bayesian generalization bound. Standard bounds contain a \sqrt{L_n \cdot \KL/n} complexity term which dominates unless , the empirical error of the learning algorithm's randomized predictions, vanishes. We manage to replace by a term which vanishes in many more situations, essentially whenever the employed learning algorithm is sufficiently stable on the dataset at hand. Our new bound consistently beats state-of-the-art bounds both on a toy example and on UCI datasets (with large enough ). Theoretically, unlike existing bounds, our new bound can be expected to converge to faster whenever a Bernstein/Tsybakov condition holds, thus connecting PAC-Bayesian generalization and {\em excess risk\/} bounds---for the latter it has long been known that faster convergence can be obtained under Bernstein conditions. Our main technical tool is a new concentration inequality which is like Bernstein's but with taken outside its expectation
Lipschitz Adaptivity with Multiple Learning Rates in Online Learning
We aim to design adaptive online learning algorithms that take advantage of any special structure
that might be present in the learning task at hand, with as little manual tuning by the user as possible.
A fundamental obstacle that comes up in the design of such adaptive algorithms is to calibrate
a so-called step-size or learning rate hyperparameter depending on variance, gradient norms, etc.
A recent technique promises to overcome this difficulty by maintaining multiple learning rates in
parallel. This technique has been applied in the MetaGrad algorithm for online convex optimization
and the Squint algorithm for prediction with expert advice. However, in both cases the user still has
to provide in advance a Lipschitz hyperparameter that bounds the norm of the gradients. Although
this hyperparameter is typically not available in advance, tuning it correctly is crucial: if it is set
too small, the methods may fail completely; but if it is taken too large, performance deteriorates
significantly. In the present work we remove this Lipschitz hyperparameter by designing new
versions of MetaGrad and Squint that adapt to its optimal value automatically. We achieve this
by dynamically updating the set of active learning rates. For MetaGrad, we further improve the
computational efficiency of handling constraints on the domain of prediction, and we remove the
need to specify the number of rounds in advance
Lipschitz and comparator-norm adaptivity in online learning
We study Online Convex Optimization in the unbounded setting where neither predictions nor gradient are constrained. The goal is to simultaneously adapt to both the sequence of gradients and the comparator. We first develop parameter-free and scale-free algorithms for a simplified setting with hints. We present two versions: the first adapts to the squared norms of both comparator and gradients separately using time per round, the second adapts to their squared inner products (which measure variance only in the comparator direction) in time per round. We then generalize two prior reducti